feat: streaming in speech to text transcription #168

mdydek · 2025-04-01T08:32:31Z

Description

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update (improves or adds clarity to existing documentation)

Tested on

iOS
Android

Testing instructions

Screenshots

Related issues

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

src/hooks/natural_language_processing/useSpeechToText.ts

src/controllers/SpeechToTextController.ts

mkopcins

Resolve conflicts, I will test it tomorrow, but looks ok in general

src/controllers/SpeechToTextController.ts

docs/docs/speech-to-text/useSpeechToText.md

src/controllers/SpeechToTextController.ts

- [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] iOS - [x] Android    - [x] I have performed a self-review of my code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [ ] My changes generate no new warnings

improved error handling, minor refactoring

pweglik

Some questions, but my main issue is including stuff from speech to text in llm example

pweglik · 2025-05-15T12:57:03Z

examples/llm/screens/LLMScreen.tsx

Do we need audio streaming in LLM example app? Maybe it would be more suitable in speech-to-text app? Right now, over half of the file has nothing to do with our LLM feature that it should showcase

pweglik · 2025-05-15T13:18:52Z

src/controllers/SpeechToTextController.ts

-      return '';
+      await this.encode(chunk);
+    } catch (error) {
+      this.onErrorCallback?.(new Error(getError(error) + ' encoding error'));


This regards all this.onErrorCallback calls in this file. What if dev doesn't pass onErrorCallback in constructor? It would make all those errors disappear into thin air. We may use something like suggested here:
#183
Maybe even define this.errorCallback in constructor like:

this.errorCallback = () => { if (this.errorCallback) { this.errorCallback(getError(e)); } else { throw new Error(getError(e)); }

or some variation. We may also require to pass errorCallback or ignore this issue altogheter but I think it may result in poor DX

src/controllers/SpeechToTextController.ts

pweglik · 2025-05-15T13:50:54Z

src/controllers/SpeechToTextController.ts

+  private trimLeft(numOfTokensToSlice: number) {
+    for (let idx = 1; idx < this.seqs.length; idx++) {
+      if (this.seqs[idx]![0] === this.config.tokenizer.bos)
+        this.seqs[idx] = this.seqs[idx]!.slice(numOfTokensToSlice);
+    }
+  }
+
+  private trimRight(numOfTokensToSlice: number) {
+    for (let idx = 0; idx < this.seqs.length - 1; idx++) {
+      if (this.seqs[idx]!.at(-1) === this.config.tokenizer.eos)
+        this.seqs[idx] = this.seqs[idx]!.slice(0, -numOfTokensToSlice);
+    }
+  }
+
+  private async trimSequences(audioLanguage?: string) {
+    const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage))
+      .length;
+    this.trimLeft(numSpecialTokens + NUM_TOKENS_TO_SLICE);
+    this.trimRight(numSpecialTokens + NUM_TOKENS_TO_SLICE);
+  }


We call trimSeqence in loop where we push new seq. Wouldn't it be enough then:

private trimLeft(newSeq, numOfTokensToSlice: number) { if (newSeq[0] === this.config.tokenizer.bos) newSeq = newSeq.slice(numOfTokensToSlice); } private trimRight(newSeq, numOfTokensToSlice: number) { if (newSeq.at(-1) === this.config.tokenizer.eos) newSeq = newSeq.slice(0, -numOfTokensToSlice); } private async trimSequences(newSeq: number[], audioLanguage?: string) { const numSpecialTokens = (await this.getStartingTokenIds(audioLanguage)) .length; this.trimLeft(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE); this.trimRight(newSeq, numSpecialTokens + NUM_TOKENS_TO_SLICE); }

Almost feel like you can do without trimLeft and trimRight - just put it all in one function

this code is not strictly correct since we trimRight the second to last seq and trim left the last seq, but yes, I can simplify this

pweglik · 2025-05-15T14:02:38Z

src/controllers/SpeechToTextController.ts

+      this.waveform = this.waveform.slice(
+        this.windowSize - this.overlapSeconds * Number(this.seqs.length === 0)
+      );
+      const seq = await this.decodeChunk(chunk, audioLanguage);


We decode it online. What happens if decoding is slower than new samples coming in? Is it realistic case or it never happens?

mdydek requested a review from mkopcins April 1, 2025 08:32

mkopcins changed the title ~~@md/s2t streaming~~ feat: @md/s2t streaming Apr 1, 2025

mdydek changed the title ~~feat: @md/s2t streaming~~ feat: streaming in speech to text transcription Apr 1, 2025

mdydek force-pushed the @md/s2t_streaming branch 4 times, most recently from 72b2505 to 0f270a7 Compare April 7, 2025 14:17

mkopcins requested changes Apr 10, 2025

View reviewed changes

mkopcins requested changes Apr 15, 2025

View reviewed changes

src/controllers/SpeechToTextController.ts Outdated Show resolved Hide resolved

NorbertKlockiewicz requested changes Apr 17, 2025

View reviewed changes

mkopcins requested changes Apr 17, 2025

View reviewed changes

mkopcins self-assigned this May 7, 2025

mkopcins force-pushed the @md/s2t_streaming branch 2 times, most recently from 86cde69 to f6f8f31 Compare May 15, 2025 12:13

fixed S2T streaming

a4c262a

improved error handling, minor refactoring

mkopcins force-pushed the @md/s2t_streaming branch from f6f8f31 to a4c262a Compare May 15, 2025 12:15

mkopcins and others added 2 commits May 15, 2025 14:18

Merge branch 'v0.4.0-rc1' into @md/s2t_streaming

846ac06

missing import

135eb72

mkopcins force-pushed the @md/s2t_streaming branch from ddec73b to 135eb72 Compare May 15, 2025 12:46

pweglik requested changes May 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: streaming in speech to text transcription #168

feat: streaming in speech to text transcription #168

mdydek commented Apr 1, 2025 •

edited

Loading

mkopcins left a comment

pweglik left a comment

pweglik May 15, 2025

pweglik May 15, 2025

pweglik May 15, 2025

mkopcins May 16, 2025

pweglik May 15, 2025

feat: streaming in speech to text transcription #168

Are you sure you want to change the base?

feat: streaming in speech to text transcription #168

Conversation

mdydek commented Apr 1, 2025 • edited Loading

Description

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

mkopcins left a comment

Choose a reason for hiding this comment

pweglik left a comment

Choose a reason for hiding this comment

pweglik May 15, 2025

Choose a reason for hiding this comment

pweglik May 15, 2025

Choose a reason for hiding this comment

pweglik May 15, 2025

Choose a reason for hiding this comment

mkopcins May 16, 2025

Choose a reason for hiding this comment

pweglik May 15, 2025

Choose a reason for hiding this comment

mdydek commented Apr 1, 2025 •

edited

Loading